This section shows how to build a supermatrix by providing minimal requirements for gene content per taxon (OTU). This approach is more suited for small scale analysis, because it relies on manual decisions, whereas large scale suprematrices are better constructed with the parameter space and data explorations tools of ReproPhylo. However, these are not addressed in this section. First, lets load our Project with the trimmed alignments:
In [1]:
from reprophylo import *
pj = unpickle_pj('outputs/my_project.pkpj', git=False)
The main decision to make when building a supermatrix is what metadata will be used to indicate that sequences of several genes belong to the same OTU in the tree. Obvious candidates would be the species name (stored as 'source_organism' if we read a GenBank file), or sample ID, voucher specimen and so on. Often, we would be required to modify the metadata in our Project, in a way that will correctly reflect the relationship between sequences that emerged from the same sample.
In the case of the Tetillidae.gb example file, sample IDs are stored either under 'source_specimen_voucher' or 'source_isolate'. In addition, identical voucher numbers are sometimes formatted differently for different genes.
In the file 'data/Tetillida_otus_corrected.csv', I have unified the columns 'source_specimen_voucher' and 'source_isolate' in a single column called 'source_otu' and also made sure to uniformly format all the voucher specimens:
In [2]:
from IPython.display import Image
Image('images/fix_otus.png', width = 400)
Out[2]:
Our Project has to be updated with the recent changes to the spreadsheet:
In [3]:
pj.correct_metadata_from_file('data/Tetillida_otus_corrected.csv')
Such fixes can also be done programmatically (see section 3.4)
Supermatrices are configured with objects of the class Concatenation. In a Concatenation object we can indicate the following:
locus objects rather than just Locus names)ProjectHere is an example:
In [4]:
concat = Concatenation('large_concat', # Any unique string
pj.loci, # This is a list of Locus objects
'source_otu', # The values of this qualifier
# flag sequences the belong to the same
# sample
otu_must_have_all_of=['MT-CO1'], # All the OTUS must have a cox1 sequence
otu_must_have_one_of=[['18s','28s']], # All the OTUs must have either 18s or 28s or both
define_trimmed_alns=[] # We only have one alignment per gene
# so the list is empty (default value)
)
If we print this Concatenation object we get this message:
In [5]:
print concat
Building the suprematrix has two steps. First we need to mount the Concatenation object onto the Project where it will be stored in the list pj.concatenations. Second, we need to construct the MultipleSeqAlignment object, which will be stored in the pj.trimmed_alignments dictionary, under the key 'large_concat' in this case:
In [6]:
pj.add_concatenation(concat)
pj.make_concatenation_alignments()
In [8]:
pickle_pj(pj, 'outputs/my_project.pkpj')
Out[8]:
In [ ]:
# Design a supermatrix
concat = Concatenation('concat_name', loci_list, 'otu_qualifier' **kwargs)
# Add it to a project
pj.add_concatenation(concat)
# Build supermatrices based on the Concatenation
# objects in pj.concatenations
pj.make_concatenation_alignments()